Zurich by the Numbers - Predictive Insights into Tourism Dynamic
Authors
Affiliation
Name I, First Name I
University of Lausanne
Name II, First Name II
Published
May 1, 2024
Abstract
The following Forecasting project focuses on applying forecasting techniques to predict tourism trends in Zurich. This analysis aims to harness the power of historical data combined with forecasting algorithms to provide actionable insights into future tourism patterns. We engage in comprehensive data preparation, explore various predictive models, and conduct a detailed evaluation of their forecasting accuracy. The project encapsulates the challenge of turning complex data into understandable and strategic information, crucial for effective decision-making in Zurich’s tourism sector.
1 Exploration & Visualization
1.1 Objectives
The main objectives of this project is to predict :
The overnight stays of the visitors in Vaud, from October 2023 until December 2024.
The overnight satys of visitors from Philippines to Zürich, from October 2023 until December 2024.
2 DATA
2.1 Cleaning & Wrangling
2.1.1 Tourism Data - All
The dataset contains information about the overnights stays by tourists in the various Swiss cantons. It indicates the tourist’s country of origin, the canton of stay, the month, the year and the total number of overnights stays.
As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.
Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")#print unique values in month columnunique(tourism_data$Monat)#> [1] "Januar" "Februar" "März" "April" "Mai" #> [6] "Juni" "Juli" "August" "September" "Oktober" #> [11] "November" "Dezember"# change ' [1] "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli" "August" "September" "Oktober" "November" "Dezember" into english month'tourism_data$Monat <- tourism_data$Monat %>%recode_factor("Januar"="January","Februar"="February","März"="March","April"="April","Mai"="May","Juni"="June","Juli"="July","August"="August","September"="September","Oktober"="October","November"="November","Dezember"="December")#add date type column for plotting purposestourism_data <- tourism_data %>%mutate(Date =dmy(paste("01", Monat, Jahr)))# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the totaltourism_data_no_total <- tourism_data %>%filter(Herkunftsland !="Herkunftsland - Total")#check for NANsum(is.na(tourism_data_no_total))#> [1] 51395#analyse the NAN values, where are they(tourism_data_no_total %>%filter(is.na(value)))#> # A tibble: 51,395 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Schwe~ Janu~ 2005 NA 2005-01-01#> 2 Zypern Schwe~ Janu~ 2005 NA 2005-01-01#> 3 Mexiko Schwe~ Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005 NA 2005-01-01#> 5 Bahrain Schwe~ Janu~ 2005 NA 2005-01-01#> 6 Katar Schwe~ Janu~ 2005 NA 2005-01-01#> 7 Kuwait Schwe~ Janu~ 2005 NA 2005-01-01#> 8 Australien Schwe~ Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Schwe~ Janu~ 2005 NA 2005-01-01#> 10 Oman Schwe~ Janu~ 2005 NA 2005-01-01#> # i 51,385 more rows#show data using reactable only showing the first 100 rowsreactable::reactable(head(tourism_data_no_total, 1000))
2.1.2 Tourism Data - Vaud
Click to show code
# Filter by canton Vaud tourism_vaud <- tourism_data %>%filter(Kanton =="Vaud")#check for NANsum(is.na(tourism_vaud))#> [1] 1869#show the data in a table using reactablereactable::reactable(head(tourism_vaud, 20))
2.1.3 Tourism Data - Zurich
We filtered the ‘Canton’ column to keep only the canton of Zürich.
Click to show code
#filter column 'Kanton' for Zurichtourism_data_zurich <- tourism_data_no_total %>%filter(Kanton =="Zürich")#check for NANsum(is.na(tourism_data_zurich))#> [1] 1869#analyse the NAN values, where are theytourism_data_zurich %>%filter(is.na(value))#> # A tibble: 1,869 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Zürich Janu~ 2005 NA 2005-01-01#> 2 Zypern Zürich Janu~ 2005 NA 2005-01-01#> 3 Mexiko Zürich Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005 NA 2005-01-01#> 5 Bahrain Zürich Janu~ 2005 NA 2005-01-01#> 6 Katar Zürich Janu~ 2005 NA 2005-01-01#> 7 Kuwait Zürich Janu~ 2005 NA 2005-01-01#> 8 Australien Zürich Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Zürich Janu~ 2005 NA 2005-01-01#> 10 Oman Zürich Janu~ 2005 NA 2005-01-01#> # i 1,859 more rows#show the data in a table using reactablereactable::reactable(head(tourism_data_zurich, 1000))
2.1.4 Tourism Data - Zurich and Philipines
Same as above, but we’re also filtering the country of origin, keeping only ‘Philippinen’ as that is what we’re aiming for.
Click to show code
tourism_data_zurich_philippines <- tourism_data_zurich %>%filter(Herkunftsland =="Philippinen")#show table using reactablereactable::reactable(tourism_data_zurich_philippines)
2.1.5 Deal with NAN
We have none in the data filtered with zurich and philippines, but if we would have we would :
2.1.5.1 Impute missing values ARIMA
If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.
Click to show code
# #Creating a tsibble with missing values# data <- tourism_data_zurich_philippines %>%# as_tsibble(key = c(Kanton, Herkunftsland, Monat, Jahr)) %>%# select(Date, value) %>%# fill_gaps()# # # Fit an ARIMA model to data with missing values# model_fit <- data %>%# model(ARIMA(value))# # # Interpolate missing values using the fitted ARIMA model# filled_data <- model_fit %>%# interpolate(data)# # # Print the data with filled in missing values# print(filled_data)
3 EDA - Vaud
3.1 Visitors from different countries in Vaud
Click to show code
plot_vaud <- tourism_vaud %>%filter(Herkunftsland !='Herkunftsland - Total') %>%ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,text =paste("Country:", Herkunftsland, "Trips:", value))) +# Added text for tooltipgeom_line(show.legend =FALSE) +labs(title ="Number of visitors from Each Country to Vaud",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly objectinteractive_plot <-ggplotly(plot_vaud, tooltip ="text")# Adjust plotly settings interactive_plot <- interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2)) # Adjust legend position# Display the interactive plotinteractive_plot
According to the graph the most tourist in canton Vaud are Swiss and France.
Graphical view of total number of tourists in canton Vaud
Click to show code
head(tourism_vaud)#> # A tibble: 6 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Herkunftsland - Total Vaud January 2005 57942 2005-01-01#> 2 Schweiz Vaud January 2005 27853 2005-01-01#> 3 Baltische Staaten Vaud January 2005 103 2005-01-01#> 4 Deutschland Vaud January 2005 3052 2005-01-01#> 5 Frankreich Vaud January 2005 7667 2005-01-01#> 6 Italien Vaud January 2005 2154 2005-01-01tourism_vaud_total <- tourism_vaud %>%filter(Herkunftsland =='Herkunftsland - Total') %>%select(-c(Herkunftsland, Kanton, Monat, Jahr))tourism_vaud_total%>%ggplot(aes(x = Date, y = value)) +geom_line() +labs(x ="Date", y ="Number of tourists", title ="Total number of tourists in canton Vaud") +theme_minimal()
3.1.1 Decomposition
Click to show code
# Convert data to a time series objectvaud_ts <- tourism_vaud_total %>%arrange(Date) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date), max(Date), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date))))# Decompose the time seriesvaud_ts %>%decompose() %>%plot()
We see clear seasonality with picks in summer and dip in January.
We saw a big drop in March 2020 due to the global coronavirus pandemic.
4 EDA - Zurich
4.1 Zurich and All visiting countries
Click to show code
# Preparing the data#removing value 'Schweiz' in column 'Herkunftsland' as it is just the whole of Switzerlanddata <- tourism_data_zurich %>%filter(!is.na(value)) %>%# Removing rows with NA values in the 'value' columnmutate(Monat =month(Date, label =TRUE, abbr =TRUE), # Extract month from DateJahr =year(Date)) %>%# Extract year from Dategroup_by(Herkunftsland, Date) %>%# Group by country and datesummarise(Trips =sum(value), .groups ='drop') # Summing up trips for each country per datep <-ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,color = Herkunftsland =="Philippinen",text =paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +# Added text for tooltipgeom_line(show.legend =FALSE) +scale_color_manual(values =c("TRUE"="red", "FALSE"="grey")) +labs(title ="Number of Trips from Each Country to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly objectinteractive_plot <-ggplotly(p, tooltip ="text")# Adjust plotly settings interactive_plot <- interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2)) # Adjust legend position# Display the interactive plotinteractive_plot
4.2 Zurich and Philipines Visitors
This graph shows changes in the number of overnight stays between January 2005 and September 2023.
Germans are the most frequent visitors to Zurich. From 2022 onwards, the United States will be the first country to visit Zurich. The red curve representing the Philippines is flat and very close to 0 (little volatility), indicating a considerably lower number of trips from this country to Zurich over the entire period observed.
A drastic drop can be observed. This could be due to the COVID-19 pandemic, which had a significant impact on international travel. At first glance, we can observe regular seasonal peaks for most countries, suggesting the presence of seasonality in tourism to the canton of Zurich.
4.3 Zurich and Philippinens Visitors
Click to show code
# use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axisp <-ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +geom_line() +labs(title ="Number of Trips from Philipines to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()p
Slight seasonality
General upward trend in the number of nights spent before 2020.
Very sharp fall in 2020 - covid
General upward trend after the fall
4.3.1 Pattern
4.3.1.1 Decompose
Click to show code
# Convert data to a time series objecttourism_ts <- tourism_data_zurich_philippines %>%arrange(Date) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date), max(Date), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date))))# Decompose the time seriesdecomposed <-decompose(tourism_ts)# Plot the decomposed componentsplot(decomposed)
4.3.1.2 Seasonality
Click to show code
# Plot the seasonality in one chart ggseasonplot(tourism_ts, year.labels =TRUE, year.labels.left =TRUE)
Click to show code
# several chart per month to see the seasonalityggsubseriesplot(tourism_ts) +ylab("Number of tourists") +xlab("Month") +ggtitle("Seasonal subseries plot")#debug#better to use gg_subseries to see the seasonality#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")
4.3.1.3 Trend
Click to show code
# Show the trend using moving average# tourism_data_zurich_philippines %>%# autoplot(value) +# autolayer(ma(value, 12), series = "12-month moving average") +# labs(title = "Trend in the number of tourists from Philipines to Zurich",# x = "Date",# y = "Number of tourists") +# theme_minimal()
5 MODELLING
This part is about building on your knowledge of time series techniques to model your data. You can investigate various models but you should justify in your report your choices regarding these. Pay attention to the conditions that are needed to apply a specific model. Treat also carefully seasonality, outliers, colinearity, covariates, special events, etc. Remember the following steps:
Aggregation choice for hierarchical time series
Model building
Model selection
5.1 Total number of visitors in Vaud
5.1.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?
Questions ?
5.1.2 ETS model
Click to show code
ets_vaud <-ets(vaud_ts, model ="AAA")forecast_ets_vaud <-forecast(ets_vaud, h =24) %>%plot(main ="Forecast of visitors in Vaud", xlab ="Date", ylab ="Number of visitors")
5.1.3 ARIMA
Click to show code
arima_vaud <-auto.arima(vaud_ts, seasonal =TRUE)# Generate forecasts for the next 2 years (24 months)forecast_arima_vaud <-forecast(arima_vaud, h =24)# Plot the forecastplot(forecast_arima_vaud, main ="ARIMA Forecast for Vaud Tourism", xlab ="Date", ylab ="Number of Tourists")
Question: Forecast for each country or general forecast?
5.2 Zurich and Philipines
5.2.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?